{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Churn Prediction\n",
    "\n",
    "This notebook will introduce the use of the churn dataset to create churn prediction model using deep kernel learning.\n",
    "\n",
    "The dataset used to ingest is from SIDKDD 2009 competition. \n",
    "\n",
    "The pipeline is composed using Azure ML pipeline and trained on Azure ML compute with hyper parameters of the gaussian process and the neural network jointly tuned through hyperdrive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import os\n",
    "import urllib\n",
    "\n",
    "from azureml.core import  (Workspace,Run,VERSION,\n",
    "                           Experiment,Datastore)\n",
    "from azureml.core.runconfig import (RunConfiguration,\n",
    "                                    DEFAULT_GPU_IMAGE)\n",
    "from azureml.core.conda_dependencies import CondaDependencies\n",
    "from azureml.core.compute import (AmlCompute, ComputeTarget)\n",
    "from azureml.exceptions import ComputeTargetException\n",
    "from azureml.data.data_reference import DataReference\n",
    "from azureml.pipeline.core import (Pipeline, \n",
    "                                   PipelineData)\n",
    "from azureml.pipeline.steps import (HyperDriveStep,PythonScriptStep)\n",
    "from azureml.train.dnn import PyTorch\n",
    "from azureml.train.hyperdrive import *\n",
    "from azureml.widgets import RunDetails\n",
    "\n",
    "\n",
    "print('SDK verison', VERSION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variables declaration\n",
    "\n",
    "Declare variables to be used through out, please fill in the Azure subscription ID, resource-group and workspace name to connect to your Azure ML workspace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SUBSCRIPTION_ID = ''\n",
    "RESOURCE_GROUP = ''\n",
    "WORKSPACE_NAME = ''\n",
    "\n",
    "PROJECT_DIR = os.getcwd()\n",
    "EXPERIMENT_NAME = \"customer_churn\"\n",
    "CLUSTER_NAME = \"gpu-cluster\"\n",
    "\n",
    "DATA_DIR = os.path.join(PROJECT_DIR,'data')\n",
    "TRAIN_DIR = os.path.join(PROJECT_DIR,'code','train')\n",
    "PREPROCESS_DIR = os.path.join(PROJECT_DIR,'code','preprocess')\n",
    "\n",
    "SOURCE_URL ='https://amlgitsamples.blob.core.windows.net/churn'\n",
    "FILE_NAME = 'CATelcoCustomerChurnTrainingSample.csv'\n",
    "PYTORCH_SUPPORTED_VERSION = '1.1'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize workspace\n",
    "\n",
    "Initialize a workspace object "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ws = Workspace(workspace_name = WORKSPACE_NAME,\n",
    "               subscription_id = SUBSCRIPTION_ID ,\n",
    "               resource_group = RESOURCE_GROUP\n",
    "              )\n",
    "\n",
    "ws.write_config()\n",
    "\n",
    "print('Workspace loaded:', ws.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload  dataset to blob datastore\n",
    "\n",
    "Upload dataset to workspace default blob storage which will be mounted on AzureML compute during pipeline execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "default_store = default_datastore=ws.datastores[\"workspaceblobstore\"]\n",
    "default_store.upload(src_dir=DATA_DIR, target_path='churn', overwrite=True, show_progress=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieve or create a Azure Machine Learning compute\n",
    "\n",
    "Here we create a new Azure Machine Learning Compute in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.\n",
    "\n",
    "If you have already created an Azure ML compute in your workspace, just provide it's name in the cell below to have it used for Azure ML pipeline execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster_name = \"cluster\"\n",
    "\n",
    "try:\n",
    "    cluster = ComputeTarget(ws, cluster_name)\n",
    "    print(cluster_name, \"found\")\n",
    "    \n",
    "except ComputeTargetException:\n",
    "    print(cluster_name, \"not found, provisioning....\")\n",
    "    provisioning_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',max_nodes=4)\n",
    "\n",
    "    \n",
    "    cluster = ComputeTarget.create(ws, cluster_name, provisioning_config)\n",
    "    \n",
    "cluster.wait_for_completion(show_output=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline definition\n",
    "\n",
    "\n",
    "The Azure ML pipeline is composed of two steps: \n",
    " \n",
    " - Data pre-processing which consist of one-hot encoding categorical features, normalization of the features set, spliting of dataset into training/testing sets and finally writing out the output to storage.\n",
    " \n",
    " - Hyperdrive step that tune and train the deep kernel learning model using GPytorch and Pytorch estimator "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline data input/output\n",
    "\n",
    "Here, we define the input and intermediary dataset that will be used by the pipeline steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_dir = DataReference(datastore=default_store,\n",
    "                          data_reference_name=\"input_data\",\n",
    "                          path_on_datastore=\"churn\"\n",
    "                         )\n",
    "\n",
    "processed_dir = PipelineData(name = 'processed_data',\n",
    "                             datastore=default_store\n",
    "                            )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline 1st step: Data Preprocessing\n",
    "\n",
    "We start by defining the run configuration with the needed dependencies by the preprocessing step.\n",
    "\n",
    "In the cell that follow, we compose the first step of the pipeline.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cd = CondaDependencies()\n",
    "cd.add_conda_package('pandas')\n",
    "cd.add_conda_package('matplotlib')\n",
    "cd.add_conda_package('numpy')\n",
    "cd.add_conda_package('scikit-learn')\n",
    "\n",
    "\n",
    "run_config = RunConfiguration(framework=\"python\",\n",
    "                              conda_dependencies= cd)\n",
    "run_config.target = cluster\n",
    "run_config.environment.docker.enabled = True\n",
    "run_config.environment.docker.base_image = DEFAULT_GPU_IMAGE\n",
    "run_config.environment.python.user_managed_dependencies = False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pre_processing = PythonScriptStep(\n",
    "                            name='preprocess dataset',\n",
    "                            script_name='preprocess.py',\n",
    "                            arguments=['--input_path', input_dir,\\\n",
    "                                         '--output_path', processed_dir],\n",
    "                            inputs=[input_dir],\n",
    "                            outputs=[processed_dir],\n",
    "                            compute_target=cluster_name,\n",
    "                            runconfig=run_config,\n",
    "                            source_directory=PREPROCESS_DIR\n",
    "                        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline second step: training\n",
    "\n",
    "For the second step, we start by defining the pytorch estimator that will be used to traing the Stochastic variational deep kernel learning model using Gpytorch.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimator = PyTorch(source_directory=TRAIN_DIR,\n",
    "                    conda_packages=['pandas', 'numpy', 'scikit-learn'],\n",
    "                    pip_packages=['gpytorch'],\n",
    "                    compute_target=cluster,\n",
    "                    entry_script='svdkl_entry.py',\n",
    "                    use_gpu=True,\n",
    "                    framework_version=PYTORCH_SUPPORTED_VERSION\n",
    "                   )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we configure Hyperdrive by defining the hyperparametes space and select choose Area under the curve as the metric to optimize for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ps = RandomParameterSampling(\n",
    "    {\n",
    "        '--batch-size': choice(4096,8192),\n",
    "        '--epochs': choice(500),\n",
    "        '--neural-net-lr': loguniform(-4,-2),\n",
    "        '--likelihood-lr': loguniform(-4,-2),\n",
    "        '--grid-size': choice(32,64),\n",
    "        '--grid-bounds': choice(-1,0),\n",
    "        '--latent-dim': choice(2),\n",
    "        '--num-mixtures': choice(4,6,8)\n",
    "    }\n",
    ")\n",
    "\n",
    "early_termination_policy = BanditPolicy(evaluation_interval=10, slack_factor=0.1)\n",
    "\n",
    "hd_config = HyperDriveConfig(estimator=estimator, \n",
    "                                hyperparameter_sampling=ps,\n",
    "                                policy=early_termination_policy,\n",
    "                                primary_metric_name='auc', \n",
    "                                primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, \n",
    "                                max_total_runs=12,\n",
    "                                max_concurrent_runs=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Last, we define the hyperdrive step of the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hd_step = HyperDriveStep(\n",
    "    name=\"hyper parameters tunning\",\n",
    "    hyperdrive_config=hd_config,\n",
    "    estimator_entry_script_arguments=['--data-folder', processed_dir],\n",
    "    inputs=[processed_dir])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build & Execute pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = Pipeline(workspace=ws, steps=[hd_step],\n",
    "                    default_datastore=default_store\n",
    "                   )\n",
    "pipeline_run = Experiment(ws, 'Customer_churn').submit(pipeline,\n",
    "                                                      regenerate_outputs=True)\n",
    "RunDetails(pipeline_run).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.widgets import RunDetails\n",
    "RunDetails(pipeline_run).show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:amlenv]",
   "language": "python",
   "name": "conda-env-amlenv-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
